{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "from pathlib import Path"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:files_size)=\n",
    "# File Size\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computers have finite resources. You have likely\n",
    "encountered these limits firsthand if your computer has slowed down from having\n",
    "too many applications open at once. We want to make sure that we do not\n",
    "exceed the computer's limits while working with data, and we might choose to examine a file differently depending on its size. If we know that our dataset is relatively small, then a text editor or a spreadsheet can be convenient for looking at the data. On the other hand, for large datasets, a more programmatic exploration or even distributed computing tools may be needed."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In many situations, we analyze datasets downloaded from the internet. These\n",
    "files reside on the computer's *disk storage*. In order to use Python to\n",
    "explore and manipulate the data, we need to read the data into the computer's\n",
    "*memory*, also known as random access memory (RAM). All Python code requires\n",
    "the use of RAM, no matter how short the code is.\n",
    "A computer's RAM is typically much smaller than its disk storage. For\n",
    "example, one computer model released in 2018 had 32 times more disk storage\n",
    "than RAM.  Unfortunately, this means that data files can often be much bigger\n",
    "than what is feasible to read into memory. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both disk storage and RAM capacity are measured in terms of *bytes* (eight 0s and 1s). Roughly\n",
    "speaking, each character in a text file adds one byte to a file's size. \n",
    "To succinctly describe the sizes of larger files, we use the prefixes described\n",
    "in {numref}`byte-prefixes`. For example, a file containing 52,428,800 characters takes \n",
    "up $5,242,8800 / 1,024^2 = 50~{\\textrm{mebibytes}}$, or 50 MiB on disk."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{table} Prefixes for common file sizes.\n",
    ":name: byte-prefixes\n",
    "\n",
    "| Multiple | Notation | Number of bytes |\n",
    "| -------- | -------- | --------------- |\n",
    "| Kibibyte | KiB      | 1,024    |\n",
    "| Mebibyte | MiB      | 1,024²   |\n",
    "| Gibibyte | GiB      | 1,024³   |\n",
    "| Tebibyte | TiB      | 1,024⁴   |\n",
    "| Pebibyte | PiB      | 1,024⁵   |\n",
    "\n",
    ":::\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "Why use multiples of 1,024 instead of simple multiples of 1,000 for these\n",
    "prefixes? This is a historical result of the fact that most computers\n",
    "use a binary number scheme where powers of 2 are simpler to represent ($1,024 = 2^{10}$). You also see the typical SI prefixes used to describe size---kilobytes, megabytes,\n",
    "and gigabytes, for example. Unfortunately, these prefixes are used\n",
    "inconsistently. Sometimes a kilobyte refers to 1,000 bytes; other times, a\n",
    "kilobyte refers to 1,024 bytes. To avoid confusion, we stick to kibi-,\n",
    "mebi-, and gibibytes, which clearly represent multiples of 1,024.\n",
    "\n",
    ":::"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is not uncommon to have a data file happily stored on a computer that will overflow the computer's memory if we attempt to\n",
    "manipulate it with a program. So we often begin our\n",
    "data work by making sure the files are of manageable size. To do this, we use the built-in `os` library:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File                      Size (KiB)\n",
      "data/inspections.csv      455.0\n",
      "data/co2_mm_mlo.txt       50.0\n",
      "data/violations.csv       3639.0\n",
      "data/DAWN-Data.txt        273531.0\n",
      "data/legend.csv           0.0\n",
      "data/businesses.csv       645.0\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "kib = 1024\n",
    "line = '{:<25} {}'.format\n",
    "\n",
    "print(line('File', 'Size (KiB)'))\n",
    "for filepath in Path('data').glob('*'):\n",
    "    size = os.path.getsize(filepath)\n",
    "    print(line(str(filepath), np.round(size / kib)))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the _businesses.csv_ file takes up 645 KiB on disk, making it well\n",
    "within the memory capacities of most systems. Although the _violations.csv_\n",
    "file takes up 3.6 MiB of disk storage, most machines can easily read it into a\n",
    "`pandas` `DataFrame` too. But _DAWN-Data.txt_, which contains the DAWN survey data,\n",
    "is much larger."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The DAWN file takes up roughly 270 MiB of disk storage, and while some computers\n",
    "can work with this file in memory, it can slow down other systems. To make\n",
    "this data more manageable in Python, we can, for example, load in a subset of the columns\n",
    "rather than all of them."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes we are interested in the total size of a folder\n",
    "instead of the size of individual files. For example, we have three restaurant files, \n",
    "and we might like to see whether we can combine all the data into a single dataframe. In the following code, we calculate the size of the _data_ folder, including all files in it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The data/ folder contains 271.80 MiB\n"
     ]
    }
   ],
   "source": [
    "mib = 1024**2\n",
    "\n",
    "total = 0\n",
    "for filepath in Path('data').glob('*'):\n",
    "    total += os.path.getsize(filepath) / mib\n",
    "\n",
    "print(f'The data/ folder contains {total:.2f} MiB')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ":::{note}\n",
    "\n",
    "As a rule of thumb, reading in a file using `pandas`\n",
    "usually requires at least five times\n",
    "the available memory as the file size.\n",
    "For example, reading in a 1 GiB file typically requires at least 5 GiB of available memory.\n",
    "Memory is shared by all programs running on a computer, including the\n",
    "operating system, web browsers, and Jupyter notebook itself. A computer\n",
    "with 4 GiB total RAM might have only 1 GiB available RAM with many applications\n",
    "running. With 1 GiB available RAM, it is unlikely that `pandas` will be able to\n",
    "read in a 1 GiB file.\n",
    "\n",
    ":::\n",
    "\n",
    "There are several strategies for working with data that are far larger than what is feasible to load into memory. We describe a few of them next."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The popular term _big data_ generally refers to the scenario where the data are large enough that even top-of-the-line computers can't read the data directly into memory.\n",
    "This is a common scenario in scientific domains like astronomy, where telescopes capture images of space that can be petabytes ($ 2^{50} $) in size. While not quite as big, social media giants, health-care providers, and other companies can also struggle with large amounts of data.  \n",
    "\n",
    "Figuring out how to draw insights from these datasets is an important research\n",
    "problem that motivates the fields of database engineering and distributed\n",
    "computing. While we won't cover these fields in this book, we provide a brief overview of basic approaches:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subset the data\n",
    ": One simple approach is to work with portions of data. Rather than\n",
    "loading in the entire source file, we can either select a specific part of it\n",
    "(e.g., one day's worth of data) or randomly sample the dataset. Because\n",
    "of its simplicity, we use this approach quite often in this book. The natural\n",
    "downside is that we lose many of the benefits of analyzing a large\n",
    "dataset, like being able to study rare events."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a database system. \n",
    ": As discussed in {numref}`Chapter %s <ch:sql>`,\n",
    "relational database management systems (RDBMSs) are specifically designed to\n",
    "store large datasets. SQLite is a useful system for working with datasets that\n",
    "are too large to fit in memory but small enough to fit on disk for a single\n",
    "machine. For datasets that are too large to fit on a single machine, more\n",
    "scalable database systems like MySQL and PostgreSQL can be used. These systems\n",
    "can manipulate data that are too big to fit into memory by using SQL queries.\n",
    "Because of their advantages, RDBMSs are commonly used for data storage in\n",
    "research and industry settings. One downside is that they often require a\n",
    "separate server for the data that needs its own configuration. Another downside\n",
    "is that SQL is less flexible in what it can compute than Python, which becomes\n",
    "especially relevant for modeling. A useful hybrid approach is to use SQL to\n",
    "subset, aggregate, or sample the data into batches that are small enough to\n",
    "read into Python. Then we can use Python for more sophisticated analyses."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a distributed computing system\n",
    ": Another approach to handle complex\n",
    "computations on large datasets is to use a distributed computing system like\n",
    "MapReduce, Spark, or Ray.\n",
    "These systems work best on tasks that can be split into many smaller parts\n",
    "where they divide datasets into smaller pieces and run programs on\n",
    "all of the smaller datasets at once.\n",
    "These systems have great flexibility and can be used\n",
    "in a variety of scenarios. Their main downside is\n",
    "that they can require a lot of work to install and configure properly\n",
    "because they are typically installed across many computers that need to\n",
    "coordinate with one another."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can be convenient to use Python to determine a file format, encoding, and size.\n",
    "Another powerful tool for working with files is the shell; the shell is widely\n",
    "used and has a more succinct syntax than Python.\n",
    "In the next section, we introduce a few command-line tools available in the shell for carrying out the same tasks of finding out information about a file before reading it into a data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
